Harassment and Newcomer Retention (Paper)

Regression analysis notebook for study of harassment on newcomer retention in Wikipedia. See research project page for an overview.


In [1]:
% matplotlib inline
import pandas as pd
from dateutil.relativedelta import relativedelta
import statsmodels.formula.api as sm
import requests
from io import StringIO
import math
import pandas as pd

Load Data and take sample

Pick harassment threshold in [0.01, 0.425, 0.75, 0.85] WARNING: seeing some very threshold sensitive results! High thresholds result in harassment having positive impact on t2 activiy. Construct sample that is concatenation of a random sample and and all users who received harassment in t1.


In [2]:
threshold = 0.425

In [3]:
#Features computes in ./Harassment and Newcomer Retention Data Munging.ipynb
df_random = pd.read_csv("../../data/retention/random_user_sample_features.csv")
df_attacked = pd.read_csv("../../data/retention/attacked_user_sample_features.csv")

In [4]:
# include all harassed newcomer in the sample
df_reg = pd.concat([df_random, df_attacked[df_attacked['m1_num_attack_received_%.3f' % threshold] > 0]])
df_reg = df_reg.drop_duplicates(subset = ['user_id'])

In [5]:
df_reg.shape


Out[5]:
(105492, 40)

In [6]:
df_reg['m1_harassment_received'] = (df_reg['m1_num_attack_received_%.3f' % threshold] > 0).apply(int)
df_reg['m1_harassment_made'] = (df_reg['m1_num_attack_made_%.3f' % threshold] > 0).apply(int)

In [7]:
df_reg['m1_harassment_received'].value_counts()


Out[7]:
0    99943
1     5549
Name: m1_harassment_received, dtype: int64

In [8]:
df_reg.shape


Out[8]:
(105492, 42)

In [9]:
column_map = {
        'm1_num_days_active': 'm1_days_active',
        'm2_num_days_active' : 'm2_days_active',
        'm1_harassment_received': 'm1_received_harassment',
        'm1_harassment_made': 'm1_made_harassment',
        'm1_fraction_ns0_deleted': 'm1_fraction_ns0_deleted',
        'm1_fraction_ns0_reverted': 'm1_fraction_ns0_reverted',
        'm1_num_warnings_recieved': 'm1_warnings',
        }
        
df_reg = df_reg.rename(columns=column_map)

Regression Analysis


In [10]:
def regress(df, f, family = 'linear'):
    if family == 'linear':
        results = sm.ols(formula=f, data=df).fit()
        return results.summary().tables[1]

    elif family == 'logistic':
        results = sm.logit(formula=f, data=df).fit(disp=0)
        return results.summary().tables[1]
    else:
        return
    

def get_latex_table(results, famiily = 'linear'):
    """
    Mess of a function for turning a statsmodels SimpleTable
    into a nice latex table strinf
    """
    
    results = pd.read_csv(StringIO(results.as_csv()))
    
    if family == 'linear':
        column_map = {
            results.columns[0]: "",
            '   coef   ' : 'coef',
           'P>|t| ': "p-val",
            '    t    ': "z-stat",
           ' [95.0% Conf. Int.]': "95% CI"
        }

    elif family == 'logistic':
        column_map = {
            results.columns[0]: "",
            '   coef   ' : 'coef',
           'P>|z| ': "p-val",
            '    z    ': "z-stat",
           ' [95.0% Conf. Int.]': "95% CI"
        }
    else:
        return
        
        
    results = results.rename(columns=column_map)
    results.index = results[""]
    del results[""]
    results = results[['coef', "z-stat", "p-val", "95% CI"]]
    results['coef'] = results['coef'].apply(lambda x: round(float(x), 2))
    results['z-stat'] = results['z-stat'].apply(lambda x: round(float(x), 1))
    results['p-val'] = results['p-val'].apply(lambda x: round(float(x), 3))
    results['95% CI'] = results['95% CI'].apply(reformat_ci)
    header = """
\\begin{table}[h]
\\begin{center}
    """
    footer = """
\\end{center}
\\caption{%s}
\\label{tab:}
\\end{table}
    """
    f = f.replace("_", "\_").replace("~", "\\texttildelow\\")
    latex = header + results.to_latex() + footer % f
    print(latex)
    return results
        
    
def reformat_ci(s):
    ci = s.strip().split()
    ci = (round(float(ci[0]), 1), round(float(ci[1]), 1))
    return "[%.1f, %.1f]" % ci

RQ1: Do newcomers in general show reduced activity after experiencing harassment?


In [11]:
f ="m2_days_active ~ m1_received_harassment"
regress(df_reg, f)


Out[11]:
coef std err t P>|t| [95.0% Conf. Int.]
Intercept 0.2052 0.007 31.109 0.000 0.192 0.218
m1_received_harassment 2.8905 0.029 100.519 0.000 2.834 2.947

In [12]:
f= "m2_days_active ~ m1_days_active + m1_received_harassment"
regress(df_reg, f)


Out[12]:
coef std err t P>|t| [95.0% Conf. Int.]
Intercept -0.7172 0.005 -131.834 0.000 -0.728 -0.706
m1_days_active 0.5945 0.002 326.691 0.000 0.591 0.598
m1_received_harassment -0.3474 0.023 -15.395 0.000 -0.392 -0.303

The first regression shows that newcomers who are harassed in m1 tend to be more active in m2, indicating that harassment does not have a chilling effect on continued newcomer activity. However, this result is an artifact of the group of harassed newcomers being more active in general. After controlling for the level of activity in m1, we see that when comparing users of comparable activity levels in m1, those who get harassed are significantly less active in m2.

RQ2: Does a newcomer's gender affect how they behave after experiencing harassment?


In [13]:
f="m1_received_harassment ~ is_female"
regress(df_reg.query("has_gender == 1"), f, family = 'logistic')


Out[13]:
coef std err z P>|z| [95.0% Conf. Int.]
Intercept -1.9697 0.055 -35.684 0.000 -2.078 -1.862
is_female 0.3866 0.123 3.146 0.002 0.146 0.627

In [14]:
f="m2_days_active ~ m1_days_active + m1_received_harassment + m1_received_harassment : is_female"
regress(df_reg.query("has_gender == 1"), f)


Out[14]:
coef std err t P>|t| [95.0% Conf. Int.]
Intercept -0.9389 0.060 -15.626 0.000 -1.057 -0.821
m1_days_active 0.7568 0.011 66.062 0.000 0.734 0.779
m1_received_harassment -0.8463 0.218 -3.888 0.000 -1.273 -0.420
m1_received_harassment:is_female -0.7046 0.351 -2.007 0.045 -1.393 -0.016

For our gender analysis, we reduce our sample to the set of users who reported a gender. First off, we observe that newcomers who end up reporting a female gender are more likely to receive harassment in m1. To investigate whether the impact of receiving harassment differs across genders, we ran the same regression as in RQ1, but restricted our analysis to users who supplied a gender and added a interaction term between gender and our measure of harassment in m1. We find that when restricting to users who supplied a gender, we again see that users who received harassment have reduced activity in m2. Inspecting the regression results for the interaction term between harassment and gender indicates that the impact is not significantly different for males and females.

RQ3: How do good faith newcomers behave after experiencing harassment?


In [15]:
f="m2_days_active ~ m1_days_active + m1_received_harassment +  m1_received_harassment : m1_made_harassment + m1_received_harassment : m1_warnings"
regress(df_reg, f)


Out[15]:
coef std err t P>|t| [95.0% Conf. Int.]
Intercept -0.7153 0.005 -131.042 0.000 -0.726 -0.705
m1_days_active 0.5933 0.002 321.065 0.000 0.590 0.597
m1_received_harassment -0.2668 0.025 -10.839 0.000 -0.315 -0.219
m1_received_harassment:m1_made_harassment 0.2411 0.056 4.294 0.000 0.131 0.351
m1_received_harassment:m1_warnings -0.1599 0.011 -13.997 0.000 -0.182 -0.138

A serious potential confound in our analyses could be that the users who receive harassment are just bad faith newcomers or sock-puppets. They get attacked for their misbehavior and reduce their activity in m2 because they get blocked or because they never intended to stick around past their own attacks. To reduce this confound, we control for whether the user harassed anyone in m1 and for whether they received an user warning of any type. The results show that even users who receive harassment but did not harass anyone or receive a user warning show reduced activity in m2.

RQ4: How does experiencing harassment compare to previously studied barriers to newcomer socialization?

Halfak et al examine how user warnings and deletions and reverts correlate with newcomer retention. Here we add those features and see how they compare to measure of harassment.


In [16]:
f = "m2_days_active ~ m1_days_active +  m1_fraction_ns0_deleted + m1_fraction_ns0_reverted "
regress(df_reg.query("m1_num_ns0_edits > 0"), f)


Out[16]:
coef std err t P>|t| [95.0% Conf. Int.]
Intercept -0.7493 0.009 -80.887 0.000 -0.767 -0.731
m1_days_active 0.5965 0.002 300.840 0.000 0.593 0.600
m1_fraction_ns0_deleted -0.0926 0.036 -2.543 0.011 -0.164 -0.021
m1_fraction_ns0_reverted -0.0579 0.015 -3.832 0.000 -0.088 -0.028

In [17]:
f = "m2_days_active ~ m1_days_active + m1_received_harassment + m1_warnings +  m1_fraction_ns0_deleted + m1_fraction_ns0_reverted "
regress(df_reg.query("m1_num_ns0_edits > 0"), f)


Out[17]:
coef std err t P>|t| [95.0% Conf. Int.]
Intercept -0.7628 0.009 -81.987 0.000 -0.781 -0.745
m1_days_active 0.6130 0.002 269.827 0.000 0.609 0.617
m1_received_harassment -0.3963 0.031 -12.589 0.000 -0.458 -0.335
m1_warnings -0.0872 0.007 -12.498 0.000 -0.101 -0.073
m1_fraction_ns0_deleted -0.0807 0.036 -2.223 0.026 -0.152 -0.010
m1_fraction_ns0_reverted 0.0340 0.016 2.111 0.035 0.002 0.066

WIP: Receiving harassment is worse for a newcomer than receiving 11 warning messages or having all their first months work deleted or reverted.